UNIVERSITY OF SHARJAH 


TUTORIALS 3 
EXPLORATORY DATA ANALYSIS 


AHMED HOSSAIN,PhD 


Exploratory Data Analysis 


mo» m = = = 9ac 


AHMED HOSSAIN,PhD - Exploratory Data Analysis 1 


Biostatistics 


DATA SUMMARIES 


9 Tabular: Frequencies, relative frequencies etc. 


9 Graphical: Line graph/ diagram, Bar charts/ plot, histograms, scatter plots, box 
plots, pie chart etc. 
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Biostatistics for categorical data 


DATA SUMMARIES 


FREQUENCY TABLES Frequency Tables used to summarize 


9 Nominal or ordinal data having natural categories 
9 Discrete or continuous data, usually after data have been 
grouped into categories 


. tabulate gender 


gender | Freq. Percent Cum. 

— I —— ———— a Ó(Á—! 

Female | 965 57.07 57.07 

Male | 726 42.93 100.00 

a ÁÀ—P(€——À (— 
Total | 1,691 100.00 


. tabulate smoke 


smoke | Freq Percent Cum 
—Á— —————— 
No | 1,270 75.10 75.10 
Yes | 421 24.90 100.00 
c ——ÓÁÁ''— ZZ ZZ ZZ 
Total | 1,691 100.00 
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Biostatistics for categorical data 


DATA SUMMARIES: BIVARIATE TABLE 


. tabulate gender smoke, row 


4----------------— 
| Key | 
| NEWER. 
| frequency I 
| row percentage | 
4---------------- + 

l smoke 

gender | No Yes | Total 

PENIS ewe —— ——— — Z EZ — 

Female | 731 234 | 965 

| 75.75 24.25 | 100.00 

eas a ee 

Male | 539 187 | 726 

| 74.24 25.76 | 100.00 

————— ee —— SERA 

Total | 1,270 421 | 1,691 

| 75.10 24.90 | 100.00 
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Biostatistics for categorical data 


DATA SUMMARIES: LINE GRAPH/ DIAGRAM 


9 Used for categorical variables to show frequency or proportion in each category. 
9 Translate the data from frequency tables. 


MMR/1000 


1960 1970 1980 1990 2000 


Year 


Figure: Maternal mortality rate of (country), 
1960-2000 
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Biostatistics for categorical data 


DATA SUMMARIES: BAR CHART/ DIAGRAM 


9 Used for categorical variables to show frequency or proportion in each category. 
9 Translate the data from frequency tables. 


Maternal mortality rate by Year 
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Biostatistics for categorical data 


DATA SUMMARIES: BAR CHART AND LINE GRAPH 


Potatoes Consumed 


Kilos 
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Biostatistics for categorical data 


Beverage Consumption In Canada 
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Interpreting data correctly is Important. 


WHAT IS WRONG HERE? 


2196 of the boys and 30% of the girls 
support me; therefore I'll get 5196 
of the vote. 
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Biostatistics for categorical data 


DATA SUMMARIES: BARPLOT 


9 Find a limitation of this barplot. It is in terms of interpretation. 


Studetns with Visual Acuity corresponds to Parents myopia 
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m Good 
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Biostatistics for categorical data 


DATA SUMMARIES: BARPLOT 


Always display bar graphs with percentages. 


% Studetns with visual acuity corresponds to parents myopia 
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% of Students 


Parents Myopia 
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Biostatistics for categorical data 


DATA SUMMARIES: PIE CHART 


@ Used to to express information from frequency summary table of categorical 


data. 
9 Circle divided into slices- number of slices corresponds to the number of 


categories 
9 Relative frequency percent make it easier to create a proportional pie chart. 


Distribution of genotypes 
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Biostatistics for categorical data 


DATA SUMMARIES: LIMITATIONS OF PIE CHARTS 


It is hard to follow the data summaries with pie charts when a categorical variable has 
many categories or bi-variate table provides many categories. 
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For quantitative (discrete or continuous) data 


STEM-AND-LEAF PLOTS (STEMPLOTS) 


@ Used to visualize distribution (shape, center, range, variation) of continuous 
variables and small data. 


05 11 21 24 27 28 30 42 50 52 


* Plot all data points and rearrange in rank order: 
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(for demonstration purposes) Rotated stemplot 
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For quantitative (discrete or continuous) data 


DATA SUMMARIES: HISTOGRAM 


@ Used to visualize distribution (shape, center, range, variation) of continuous 
variables and large data. 
9 "Bin size" is important. 


100 4 


80 4 


Number of individuals 
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For quantitative (discrete or continuous) data 


DATA SUMMARIES: HISTOGRAM 


Positive skew: The right tail is longer; the mass of the distribution is concentrated on 
the left of the figure. The distribution is said to be right-skewed, right-tailed, or skewed 
to the right. 


Worldwide Adolescent Fertility Rate 
2006 
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For quantitative (discrete or continuous) data 


DATA SUMMARIES: HISTOGRAM 


Negative skew: The left tail is longer; the mass of the distribution is concentrated on 
the right of the figure. The distribution is said to be left-skewed, left-tailed, or skewed to 
the left. 


ad 
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For quantitative (discrete or continuous) data 


DATA SUMMARIES: SKEWNESS 


No Skewness 
(symmetrical) 


Positive Skewness 


(left-modal) Negative Skewness 


(right-modal) 


Time, Years 


o 
mM 
D 


AHMED HOSSAIN,PhD - Exploratory Data Analysis 18 


For quantitative (discrete or continuous) data 


EFFECT OF BIN SIZE ON HISTOGRAM 


Length of time of service calls at a bank 


2500 
7.696 of all calls 
are 5 10 seconds long 


$ 8 


Count of calls 


Count of calls 
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For quantitative (discrete or continuous) data 


DATA SUMMARIES: LOCATION AND SHAPE 


MEASURES OF CENTRAL TENDENCY Mean, Median and Mode. 
MEASURES OF SPREAD Range, Interquartile range, variance and standard deviation. 
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For quantitative (discrete or continuous) data 


CENTRAL LOCATION: MEAN AND MEADIAN 


MEAN : 
To calculate the average X ofa set of observations, add their 
value and divide by the number of observations: 


_ X++ +t, dx 
xp, n_ X, 
n ne 


MEDIAN is the exact middle value. 
9 If there are an odd number of observations, find the middle 


value 
9 If there are an even number of observations, find the middle 


two values and average them 


Example 


Some data: 
Age of participants: 17 19 21 22 23 23 23 38 


Median = (22+23)/2 = 22.5 
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For quantitative (discrete or continuous) data 


QUESTION 


Which of the following orders correctly represents the measures of central tendency for the distribution shown here? 


Relative 
Frequency 


> 


A BC Units of Measure 


a. A: mean, B: median, C: mode 
b. A: mode, B: mean, C: median 
c. À: median, B: mode, C: mean 
d. A: median, B: mean, C: mode 
e. A: mode, B: median, C: mean 
f. None of these orders are correct. 
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For quantitative (discrete or continuous) data 


SPREAD: VARIANCE AND STANDARD DEVIATION 


9 The term spread is an informal way to refer to the dispersion or variability of 
data points. The following Figure shows distributions with different variability. 


9 Populations 1 and 2 have the same central locations, but population 2 has 
greater spread (variability). 


Population 1 


Distributions with 
different spreads 


Population 2 
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For quantitative (discrete or continuous) data 


SPREAD: VARIANCE AND STANDARD DEVIATION 


VARIANCE Average of squared deviations of values from the mean. 


52 — De (% — Xf? 
n-1 


9 Increasing contribution to the variance as you go farther from 
the mean. 
STANDARD DEVIATION Standard deviations are simply the square root of the variance. 
e Roughly 68% of the observations in the list of data lie within 1 
standard deviation of the average. 
e 95% of the observations lie within 2 standard deviations of the 
average. 
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For quantitative (discrete or continuous) data 


VARIANCE: WHICH ONE HAS LESS STANDARD DEVIATION? 
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For quantitative (discrete or continuous) data 


SPREAD: QUARTILES AND INTER QUARTILE RANGE 


25% | 25% 25% 25% 


Q1 The first quartile, Q1, is the value for which 25% of the observations 
are smaller and 75% are larger. 
Q2 Q2 is the same as the median (50% are smaller, 50% are larger) 
Q3 Only 25% of the observations are greater than the third quartile. 
IQR Itis the difference between third and first quartile. 
EXAMPLE Graduate student ages: 27, 28, 31, 35, 35, 40, 42, 43, 50, 52. 
© Pgy = Qo = average of the middle two observations = 
(35+40)/2 = 37.5 years. 
© Pos = Qı = middle observation of the lower 5 observations = 
31 years. 
© P75 = Qs = middle observation of the upper 5 observations = 
43 years. 
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For quantitative (discrete or continuous) data 


DATA SUMMARIES: BOX PLOT 


o <- outside values 
o 
adjacent Tine <- upper adjacent value 
whiskers 
«- 75th percentile (upper hinge) 
box «- median 
whiskers 


adjacent Tine 


<- 25th percentile (lower hinge) 


<- lower adjacent value 
o <- outside value 
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For quantitative (discrete or continuous) data 


DATA SUMMARIES: BOX PLOT 


Worldwide Adolescent Fertility Rate 
2006 


Fraction of values 
1 T5 2 
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For quantitative (discrete or continuous) data 


QUESTION: COMPARE THE BOX PLOTS? 


m 


== S| fet 1997 E] fert 2000 
ETF] fert 2002 == S| fert_2005 


fert_2006 
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